met_plot <- read_delim("/Users/aleya/Library/CloudStorage/OneDrive-cumc.columbia.edu/Coursework/Data Science I/data/MetObjects.txt") %>% ## change this to a local call later
janitor::clean_names() %>%
select(object_id, object_name, title, accession_year,
culture, period, department, is_highlight,
geography_type, city, state, county, country, region, subregion)
met_plot <- sample_n(met_plot, 5000) ## decide how much of the data to include, if too heavy
object_nameNote that the object_name variable can be more detailed
than is necessary. Here, I try to create more general categories of
objects.
met_plot <- met_plot %>%
mutate(object_name = ifelse(
grepl("Textile", object_name), "Textile",
ifelse(grepl("Painting", object_name), "Painting",
ifelse(grepl("Relief", object_name), "Relief",
ifelse(grepl("Print", object_name), "Print",
ifelse(grepl("aseball card", object_name), "Baseball card",
ifelse(grepl("Vase", object_name), "Vase",
ifelse(grepl("rnament", object_name), "Vase",
ifelse(grepl("arring", object_name), "Earring",
ifelse(grepl("ecklace", object_name), "Necklace",
ifelse(grepl("hotograph", object_name), "Photograph",
ifelse(grepl("tatue", object_name), "Statue",
object_name))))))))))))
Note: substantial missingness for the geography variables. We may
want to limit the data to just those with geographic data, but it would
be a biased picture. The variable accession_year however,
has high completeness! And culture has moderate
completeness. Here is a table of the % rows with missing values by
selected column:
sapply(met_plot, function(x) sum(is.na(x))/5000)
## object_id object_name title accession_year culture
## 0.0000 0.0042 0.0670 0.0088 0.5550
## period department is_highlight geography_type city
## 0.8010 0.0000 0.0000 0.8708 0.9310
## state county country region subregion
## 0.9952 0.9834 0.8346 0.9330 0.9528
mypal<-c("#78B7C5", "#EBCC2A", "#FF0000", "#EABE94",
"#3B9AB2", "#B40F20", "#0B775E", "#F2300F",
"#5BBCD6", "#F98400", "#ab0213", "#E2D200",
"#ff7700", "#46ACC8", "#00A08A", "#78B7C5",
"#a7ba42", "#f94f8a", "#DD8D29")
met_plot %>%
group_by(department, accession_year) %>%
summarize(n = n()) %>%
plot_ly(x = ~accession_year, y = ~n,
color = ~department,
type = 'scatter',
mode = 'lines+markers',
colors = mypal) %>%
layout(showlegend = FALSE)
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.
Could also do this by year, or by culture.
top_5 <- met_plot %>%
group_by(department, accession_year) %>%
count(object_name) %>%
filter(n > 1)
plot2 <- ggplot(top_5, mapping = aes(x = accession_year, y = n,
color = department, size = 100*n,
label = department, label2 = object_name, label3 = n)) +
geom_jitter(alpha=0.5, position = position_jitter(width = 0.4)) +
scale_color_manual(values = mypal) +
theme(axis.text.x = element_text(angle = 35, hjust = 1),
legend.position = "none")+
labs(title="Popular objects by department, over time",
y = "Number of objects", x = "Department")
plotly2 <- ggplotly(plot2)
plotly2
Will play with this next. Need to merge the country names with a country-level shapefile.
met_plot %>%
filter(!is.na(country)) %>%
leaflet() %>%
addTiles()